#The grouping will be done visually
#we need to find groups and business groups, we have to do a bit of text sensingTakeHomeEx3
Objective definition:
FishEye International, a non-profit focused on countering illegal, unreported, and unregulated (IUU) fishing, has been given access to an international finance corporation’s database on fishing related companies. In the past, FishEye has determined that companies with anomalous structures are far more likely to be involved in IUU (or other “fishy” business). FishEye has transformed the database into a knowledge graph. It includes information about companies, owners, workers, and financial status. FishEye is aiming to use this graph to identify anomalies that could indicate a company is involved in IUU.
FishEye analysts have attempted to use traditional node-link visualizations and standard graph analyses, but these were found to be ineffective because the scale and detail in the data can obscure a business’s true structure. The research below aim to help FishEye develop a new visual analytics approach to better understand fishing business anomalies.
We will use visual analytics to understand patterns of groups in the knowledge graph and highlight anomalous groups.
Task 1: Use visual analytics to identify anomalies in the business groups present in the knowledge graph.
Task 2: Develop a visual analytics process to find similar businesses and group them. This analysis should focus on a business’s most important features and present those features clearly to the user.
Data Pre-processing and cleaning
Load the library and read the json relationship file MC2.
#echo | false
#tidytext -- text mining library with R: https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html
#Load Libraries
pacman::p_load(jsonlite,tidygraph, ggraph, visNetwork, tidyverse, shiny, plotly, graphlayouts, ggforce, tidytext,skimr)
#load Data
MC3<- fromJSON("data/MC3.json")Pick desired fields in MC3
We picked the desired fields and reorganized the columns using select function. The nodes in MC3 will be companies or person, and description about companies, with their product and services, country and revenue generated.
As we load the data, we found this diagram is not directed, so we will not know the in/out direction of connection.
Below code extract out nodes and edges out for further processing.
#glimpse(MC3)
MC3_nodes <- as_tibble(MC3$nodes)
colSums(is.na(MC3_nodes)) country id product_services revenue_omu
0 0 0 0
type
0
#extract and mutate the format so it's not list but dataframe
MC3_nodes_clean <- MC3_nodes %>% mutate(country = as.character(country),
id = as.character(id),
product_services = as.character(product_services),
revenue_omu = as.numeric(as.character(revenue_omu)), #we need to convert to numeric directly
type = as.character(type)) %>%
select(id, country, type, revenue_omu, product_services)Warning: There was 1 warning in `mutate()`.
ℹ In argument: `revenue_omu = as.numeric(as.character(revenue_omu))`.
Caused by warning:
! NAs introduced by coercion
The original data do not have NA value, however by transforming data into table format, some fields are NA.
#check data quality=
colSums(is.na(MC3_nodes_clean)) id country type revenue_omu
0 0 0 21515
product_services
0
#check which are the types?
unique(MC3_nodes_clean$type)[1] "Company" "Company Contacts" "Beneficial Owner"
skim(MC3_nodes_clean)| Name | MC3_nodes_clean |
| Number of rows | 27622 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 6 | 64 | 0 | 22929 | 0 |
| country | 0 | 1 | 2 | 15 | 0 | 100 | 0 |
| type | 0 | 1 | 7 | 16 | 0 | 3 | 0 |
| product_services | 0 | 1 | 4 | 1737 | 0 | 3244 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| revenue_omu | 21515 | 0.22 | 1822155 | 18184433 | 3652.23 | 7676.36 | 16210.68 | 48327.66 | 310612303 | ▇▁▁▁▁ |
DT::datatable(MC3_nodes_clean)Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html
MC3_edges <- as_tibble(MC3$links) %>%
distinct() %>%
mutate(source = as.character(source),
target = as.character(target),
type = as.character(type)) %>%
group_by(source, target, type) %>%
summarise(weights = n()) %>%
filter(source!=target) %>%
ungroup()`summarise()` has grouped output by 'source', 'target'. You can override using
the `.groups` argument.
#check missing value
colSums(is.na(MC3_edges)) source target type weights
0 0 0 0
There is no missing value in edges data. Explore the dataset.
DT::datatable(MC3_edges)Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html
#check which are the types?
unique(MC3_edges$type)[1] "Company Contacts" "Beneficial Owner"
#datatable() of DT package is used to display mc3_edges tibble data frame as an interactive table on the html document.In order to find the business group, we will check the type of different category of data. There might be owner - business, customer - business, business - business relationship
ggplot(data = MC3_edges,
aes(x = type)) +
geom_bar()
MC3_edges_clean <- MC3_edges %>% mutate(source = as.character(source),
target = as.character(target),
type = as.character(target)) %>%
group_by(source, target, type) %>%
summarise(weights = n()) %>%
filter(source!=target) %>%
ungroup()`summarise()` has grouped output by 'source', 'target'. You can override using
the `.groups` argument.
Handling missing values: for some or the product/services, there’s blank value such as “character(0)”, we recode these value to NA these value before pass them for text sensing.:
# Recode "character(0)" to NA in the product_services column
MC3_nodes_clean$product_services[MC3_nodes_clean$product_services == "character(0)"] <- NA
ggplot(data = MC3_nodes_clean,
aes(x = type)) +
geom_bar()
Building network model with tidygraph
id1 <- MC3_edges_clean %>%
select(source) %>%
rename(id = source)
id2 <- MC3_edges_clean %>%
select(target) %>%
rename(id = target)
MC3_nodes1 <- rbind(id1, id2) %>%
distinct() %>%
left_join(MC3_nodes_clean,
unmatched = "drop")Joining with `by = join_by(id)`
mc3_graph <- tbl_graph(nodes = MC3_nodes1,
edges = MC3_edges_clean,
directed = FALSE) %>%
mutate(betweenness_centrality = centrality_betweenness(),
closeness_centrality = centrality_closeness()) mc3_graph %>%
filter(betweenness_centrality >= 100000) %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha=0.5)) +
geom_node_point(aes(
size = betweenness_centrality,
colors = "lightblue",
alpha = 0.5)) +
scale_size_continuous(range=c(1,10))+
theme_graph()Warning in geom_node_point(aes(size = betweenness_centrality, colors =
"lightblue", : Ignoring unknown aesthetics: colours
Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.
Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
not found in Windows font database
Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Text sensing with tidytext
word count
#start a bit of text sensing, display the result by max value first
MC3_nodes_clean <- MC3_nodes_clean %>%
mutate(n_fish = str_count(product_services, "fish")) %>%
arrange(desc(n_fish))
library(ggplot2)
# MC3_nodes_clean <- MC3_nodes_clean %>%
# group_by(type) %>%
# summarize(n_fish = sum(str_count(product_services, "fish")), .groups = "drop")
ggplot(data = MC3_nodes_clean, aes(x = type, y = n_fish)) +
geom_bar(stat = "identity")Warning: Removed 18959 rows containing missing values (`position_stack()`).

Tokenisation
In text sensing, tokenisation is the process of breaking up a given text into units called tokens. We will discard characters like punctuation marks in this progress.
The two basic arguments to unnest_tokens() used here are column names. First argument is the output column name that will be created as the text is unnested into it, and then the input column that the text comes from (product_services, in this case).
nodesToken <- MC3_nodes_clean %>%
unnest_tokens (word, product_services)
#can add in to_lower = TRUE
# add in strip_punct = TRUE nodesToken %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in product_services field")Selecting by n

With above we saw the top frequently sensed word may not be useful. For example, NA, “a” and “to”. We will need to remove these words as stop words.
From the token generated, we need to take out the common/generic words and we will also exclude NA records.
tidy_stopwords <- nodesToken %>%
anti_join(stop_words)%>%
na.omit()Joining with `by = join_by(word)`
Visualization with bar chart after remove stopword
tidy_stopwords %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in product_services field")Selecting by n

#use parallel cordinate to visualize
# library(cluster)
# library(caret)
#
# MC3_nodes <- MC3_nodes %>%
# select(product_services, country, revenue_omu, type) %>%
# na.omit()
#
#
#
# #prepare data
# clustering_data <- MC3_nodes[, c("product_services", "revenue_omu", "type")]
# #try K means clustering
# k <- 4 # Number of clusters
# set.seed(123) # For reproducibility
# kmeans_result <- kmeans(clustering_data, centers = k)
#
# MC3_nodes$cluster <- as.factor(kmeans_result$cluster)
# cluster_summary <- aggregate(clustering_data, by = list(cluster = MC3_nodes$cluster), FUN = mean)
#
#
#
# pacman::p_load(GGally, parallelPlot)
# library(GGally)
# ggparcoord(MC3_nodes[, c("product_services", "country", "revenue_omu", "type","cluster")],
# columns = 1:3, groupColumn = "cluster",
# title = "Parallel Coordinate Plot: Features by Cluster")
# ploting relationship?
TODO - failed need troubleshoot
# GraphMC3 <- tbl_graph(nodes = MC3_nodes_clean,
# edges = MC3_edges_clean,
# directed = FALSE)
# #
# GraphMC3
#is_connected <- is.connected(GraphMC2)
# peopleEntityRelationship %>%
# activate(edges) %>%
# arrange(desc(weightkg))